In this question, you are allowed to use any method available to you in R, either programmed yourself, or from packages of your choosing. The important thing here is that you communicate your results very well, in terms of language, illustration, and precision.
In this question you will analyse a dataset containing volcanic rocks
and their chemical composition, collected from various sites in New
Zealand. You will predict whether the Source of a given
rock sample is “Okataina” (a geographical location), or whether the
sample comes from somewhere else.
Load the annotated dataset rocks.csv. and the unlabelled
dataset rocks_unlabelled.csv. You will find that rows in
the latter set do not contain an entry for Source, i.e.,
they are of unknown origin. Your goal will be to fill these blanks,
using the techniques you have learned in this course so far. The
following guidelines might be useful to consider:
Use a classification-tree-based method (can also be random forests or boosting), and at least one additional different data mining algorithms (two tree variants are not considered “different”). Compare their performance in a suitable way.
It will not be enough to just predict some Source
labels for the missing data, but you will need to quantify and justify
the accuracy of your predictions. It is OK if you think that your method
is not perfect, but be sure to point out its limitations in this case.
You will not be judged on accuracy, but on correctness of procedure, and
clarity of communication.
How exactly you prepare and work with the dataset is your choice, but follow good practice models as demonstrated in the course (for example, consider “data hygiene”, i.e., don’t test on training data etc.).
Be brief, exact, correct, and communicate well.
# Load required libraries
library(tidyverse) # For data manipulation and visualization
library(readr) # For reading CSV files
library(caret) # For model training and preprocessing
library(knitr) # For neat table rendering
# Load labeled and unlabeled datasets
rocks <- read_csv("../data/rocks.csv", col_types = cols(...1 = col_skip()))
rocks_unlabelled <- read_csv("../data/rocks_unlabelled.csv", col_types = cols(...1 = col_skip()))
# Inspect the structure of the datasets
# Check for missing values
skim(rocks)
| Name | rocks |
| Number of rows | 330 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| logical | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| okataina | 0 | 1 | 0.88 | TRU: 291, FAL: 39 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SiO2 | 0 | 1 | 75.62 | 3.24 | 51.29 | 75.23 | 76.34 | 77.26 | 78.66 | ▁▁▁▁▇ |
| TiO2 | 0 | 1 | 0.21 | 0.15 | 0.10 | 0.12 | 0.18 | 0.24 | 1.08 | ▇▁▁▁▁ |
| Al2O3 | 0 | 1 | 13.52 | 1.17 | 12.18 | 12.73 | 13.24 | 14.01 | 19.04 | ▇▃▁▁▁ |
| Fe2O3 | 0 | 1 | 1.87 | 1.31 | 0.43 | 1.32 | 1.56 | 1.88 | 11.85 | ▇▁▁▁▁ |
| MnO | 0 | 1 | 0.05 | 0.02 | 0.01 | 0.04 | 0.05 | 0.06 | 0.20 | ▇▇▁▁▁ |
| MgO | 0 | 1 | 0.37 | 0.60 | 0.07 | 0.18 | 0.24 | 0.32 | 5.63 | ▇▁▁▁▁ |
| CaO | 0 | 1 | 1.26 | 0.81 | 0.64 | 0.87 | 1.04 | 1.32 | 8.88 | ▇▁▁▁▁ |
| Na2O | 0 | 1 | 3.64 | 0.44 | 1.77 | 3.36 | 3.63 | 3.97 | 4.60 | ▁▁▅▇▃ |
| K2O | 0 | 1 | 3.44 | 0.57 | 0.60 | 3.18 | 3.54 | 3.83 | 5.15 | ▁▁▅▇▁ |
| P2O5 | 0 | 1 | 0.02 | 0.01 | 0.00 | 0.01 | 0.01 | 0.02 | 0.09 | ▇▅▁▁▁ |
| LOI | 0 | 1 | 3.53 | 0.92 | 0.19 | 3.09 | 3.64 | 4.15 | 5.88 | ▁▂▆▇▁ |
skim(rocks_unlabelled)
| Name | rocks_unlabelled |
| Number of rows | 30 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| SiO2 | 2 | 0.93 | 75.61 | 2.01 | 71.15 | 74.15 | 75.97 | 77.34 | 77.88 | ▂▃▂▇▇ |
| TiO2 | 2 | 0.93 | 0.20 | 0.08 | 0.11 | 0.14 | 0.17 | 0.27 | 0.34 | ▇▃▃▂▃ |
| Al2O3 | 2 | 0.93 | 13.79 | 1.17 | 12.45 | 12.75 | 13.67 | 14.42 | 16.63 | ▇▇▃▁▂ |
| Fe2O3 | 2 | 0.93 | 1.68 | 0.43 | 1.10 | 1.36 | 1.61 | 2.01 | 2.51 | ▇▇▃▆▃ |
| MnO | 2 | 0.93 | 0.05 | 0.01 | 0.04 | 0.04 | 0.05 | 0.05 | 0.06 | ▇▁▆▁▃ |
| MgO | 2 | 0.93 | 0.30 | 0.14 | 0.10 | 0.22 | 0.26 | 0.38 | 0.60 | ▂▇▂▁▂ |
| CaO | 2 | 0.93 | 1.31 | 0.53 | 0.76 | 0.94 | 1.08 | 1.83 | 2.55 | ▇▂▁▂▁ |
| Na2O | 2 | 0.93 | 3.48 | 0.29 | 2.93 | 3.27 | 3.54 | 3.63 | 4.15 | ▂▂▇▁▁ |
| K2O | 2 | 0.93 | 3.57 | 0.45 | 2.28 | 3.33 | 3.75 | 3.89 | 4.10 | ▁▁▃▂▇ |
| P2O5 | 2 | 0.93 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▁▁▁▁ |
| LOI | 2 | 0.93 | 3.77 | 0.68 | 1.67 | 3.34 | 4.03 | 4.18 | 4.73 | ▁▁▅▇▆ |
We imported two datasets:
rocks.csv: labelled training data including the
okataina variable.
rocks_unlabelled.csv: data without labels, to be
predicted later.
# Load required packages
library(randomForest) # For random forest classifier
library(caret) # For cross-validation and evaluation
library(pROC) # For AUC calculation
set.seed(62380486) # For reproducibility
# Recode the logical target variable into valid factor labels
rocks$okataina <- factor(rocks$okataina, levels = c(FALSE, TRUE), labels = c("No", "Yes"))
# Define 10-fold cross-validation control
ctrl <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE, # Needed for AUC
summaryFunction = twoClassSummary, # ROC, Sens, Spec
savePredictions = TRUE
)
# Train Random Forest model
mtry_grid <- expand.grid(mtry = 2:11) # explicitly specified a tuning grid ranging from mtry = 2 to 11
tree.rf <- train(
okataina ~ ., data = rocks,
method = "rf",
metric = "ROC", # AUC is the performance metric
trControl = ctrl,
tuneGrid = mtry_grid # the default mtry serachign range is 3, mannually set to 10.
)
# Output model summary
print(tree.rf)
## Random Forest
##
## 330 samples
## 11 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 297, 297, 297, 297, 297, 296, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9793103 0.400 1.0000000
## 3 0.9732759 0.450 0.9965517
## 4 0.9750000 0.500 0.9932184
## 5 0.9659770 0.525 0.9897701
## 6 0.9638218 0.500 0.9897701
## 7 0.9634195 0.525 0.9897701
## 8 0.9647126 0.500 0.9897701
## 9 0.9595402 0.525 0.9863218
## 10 0.9565230 0.475 0.9863218
## 11 0.9461782 0.475 0.9863218
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(tree.rf)
The mtry parameter in Random Forest controls the number
of predictors randomly selected at each split.
To explore the effect of the number of predictors at each split
(mtry), we specified a tuning grid from
mtry = 2 to 11. This allowed the model to evaluate all
candidate values through 10-fold cross-validation and select the optimal
setting based on the highest AUC. The expanded search space offers more
detailed insights into model performance and helps avoid suboptimal
defaults.
In this case, the optimal mtry is 2, selected from a
total of p = 11 features. By randomly selecting m features
from p features, we introduced randomness into the tree-growing process.
This approach increased diversity among trees, reduced correlation, and
enhanced the ensemble’s ability to reduce variance(Li lecture 2025).
In this model, mtry = 2 resulted in the highest cross-validated ROC
of ~0.98, indicating the best overall discriminative
performance. Therefore, it was chosen as the optimal setting.
# Recompute confusion matrix using best cross-validated predictions
# Note: Use predictions corresponding to the best mtry model (mtry = 2)
# Filter predictions to match selected tuning parameters
best_pred <- tree.rf$pred %>%
filter(mtry == tree.rf$bestTune$mtry)
# Generate confusion matrix
conf_matrix <- confusionMatrix(
data = best_pred$pred,
reference = best_pred$obs,
positive = "Yes" # "Yes" is the target class (Okataina)
)
# Display confusion matrix
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 15 0
## Yes 24 291
##
## Accuracy : 0.9273
## 95% CI : (0.8937, 0.9528)
## No Information Rate : 0.8818
## P-Value [Acc > NIR] : 0.004544
##
## Kappa : 0.5243
##
## Mcnemar's Test P-Value : 2.668e-06
##
## Sensitivity : 1.0000
## Specificity : 0.3846
## Pos Pred Value : 0.9238
## Neg Pred Value : 1.0000
## Prevalence : 0.8818
## Detection Rate : 0.8818
## Detection Prevalence : 0.9545
## Balanced Accuracy : 0.6923
##
## 'Positive' Class : Yes
##
# Compute ROC curve
roc_curve <- roc(
response = best_pred$obs,
predictor = best_pred$Yes, # predicted probability for class "Yes"
levels = c("No", "Yes"),
direction = "<"
)
# Plot ROC curve
plot(roc_curve,
col = "blue",
lwd = 2,
main = "Random Forest ROC Curve (10-fold CV)")
abline(a = 0, b = 1, lty = 2, col = "gray") # diagonal reference line
# Report AUC value
auc(roc_curve)
## Area under the curve: 0.9767
# Variable importance from the caret-trained model
varImpPlot(tree.rf$finalModel, main = "Variable Importance - Random Forest")
# Alternatively, get numeric importance values
importance_df <- as.data.frame(varImp(tree.rf)$importance)
importance_df <- importance_df %>%
rownames_to_column("Feature") %>%
arrange(desc(Overall))
# Pretty barplot using ggplot2
library(ggplot2)
ggplot(importance_df, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Feature Importance in Random Forest Model",
x = "Feature", y = "Importance (Gini-based)") +
theme_minimal()
We visualized feature importance using both varImpPlot()
from the randomForest package and a custom
ggplot2 barplot based on caret::varImp().
While both methods rank features similarly, they differ in the scale of
the x-axis.
varImpPlot() reports raw MeanDecreaseGini scores,
reflecting the average reduction in node impurity attributable to each
variable across the ensemble. In contrast, caret::varImp()
returns normalized importance scores on a common scale, which enhances
interpretability and aesthetic consistency for visual comparison.
Despite the difference in scale, both plots consistently identify the top predictors, reinforcing their significance in the classification task.
Random Forest provides an intrinsic measure of feature importance by evaluating the decrease in node impurity (e.g., Gini index) attributable to each feature across all trees. Features that contribute to more effective splits—especially near the root nodes—are ranked higher.
The importance plot above reveals that K2O and
CaO are the most influential in distinguishing between
Okataina and non-Okataina rocks. These variables likely exhibit
substantial distributional differences between the classes or interact
strongly with other predictors. In contrast, features with low
importance may be either weakly informative or redundant in the presence
of others.
Such insights can inform geochemical domain knowledge and guide further feature selection or dimensionality reduction.
The confusion matrix summarizes the classification performance based
on the best cross-validated model. While the model achieves perfect
sensitivity, the specificity is considerably low (~0.38),
indicating that many non-Okataina samples are misclassified as Okataina.
This suggests that the rf_tree model is overly eager to assign the
positive class.
Additionally, here we care more about the indicator of model performance specificity, which measures the proportion of correctly identified non-Okataina samples. In the context of geological classification, false positives (i.e., misclassifying a non-Okataina rock as Okataina) may lead to incorrect provenance attribution and flawed geological interpretations. Prioritizing high specificity ensures that when the model predicts a rock as coming from Okataina, it does so with high confidence, minimizing spurious identifications.
The low specificity observed in this model may be a result of class imbalance: the okataina variable is skewed, with approximately 88% ‘Yes’ and only 12% ‘No’. This imbalance could bias the model toward predicting the dominant class, reducing its ability to correctly identify the minority class.
table(rocks$okataina)
##
## No Yes
## 39 291
prop.table(table(rocks$okataina))
##
## No Yes
## 0.1181818 0.8818182
Before selecting an alternative modeling approach to Random Forest, we briefly compare several widely used non-tree classifiers. Each has distinct assumptions and strengths that make them suitable for different types of data.
| Method | Assumptions | Strengths | Limitations | Scale Sensitivity |
|---|---|---|---|---|
| kNN | No parametric assumption; relies on distance | Simple, non-parametric, interpretable | Sensitive to irrelevant features and scaling | ✅ Yes |
| QDA/LDA | Gaussian distribution; equal/unequal covariance | Efficient with small data; interpretable decision boundary | Strong assumptions; sensitive to outliers | ✅ Yes |
| SVM | Maximizes margin between classes | Effective in high-dimensional space; robust to overfitting | Computationally expensive; less interpretable | ✅ Yes |
| Logistic Regression | Linear decision boundary; independent features | Probabilistic output; interpretable coefficients | Struggles with non-linear boundaries | ✅ Yes |
k-Nearest Neighbors (kNN) makes no distributional assumptions and is ideal for exploratory use, but it is highly affected by the scale of input features and irrelevant variables.
Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) assume normality and are powerful when those assumptions hold, though they are sensitive to violations such as multicollinearity or class imbalance.
Support Vector Machines (SVM) are robust and powerful for both linear and non-linear boundaries (via kernels), but their output is less interpretable and requires careful tuning.
Logistic Regression is a solid baseline method that provides interpretable coefficients and class probabilities, but performs poorly with non-linear class boundaries. All the above models require feature standardization, as they rely on distance or assume standardized inputs.
All the above models require feature standardization, as they rely on distance or assume standardized inputs.
Given the goals of explainability, the weak dependence assumption, and performance evaluation, we suggest applying the k-NN method in this case.
# Define the name of the response variable to avoid hardcoding
label_name <- "okataina"
# Create a preprocessing object that centers and scales all predictor variables.
# This step computes the mean and standard deviation for each numeric feature.
# It excludes the response variable ("okataina") from transformation.
preproc_knn <- preProcess(
rocks[, setdiff(names(rocks), label_name)],
method = c("center", "scale")
)
# Apply the preprocessing model to the predictors to standardize them.
# Each feature is rescaled to have mean = 0 and standard deviation = 1.
rocks_knn_scaled <- predict(
preproc_knn,
rocks[, setdiff(names(rocks), label_name)]
)
# Append the original response variable ("okataina") back to the processed data
# so that the full dataset can be used for classification modeling.
rocks_knn_scaled[[label_name]] <- rocks[[label_name]]
All predictor variables were standardized using Z-score normalization, which ensures that each feature has zero mean and unit variance. This step is essential for k-Nearest Neighbors (kNN), which is sensitive to the scale of input variables. The response variable “okataina” was excluded from this transformation and appended back afterward for modeling.
# Load required libraries
library(caret)
library(pROC)
# Set seed for reproducibility
set.seed(62380486)
# Define 10-fold cross-validation control
ctrl_knn <- trainControl(
method = "cv", # k-fold cross-validation
number = 10, # 10 folds
classProbs = TRUE, # Needed for ROC and AUC
summaryFunction = twoClassSummary, # Evaluation metric: ROC, Sensitivity, Specificity
savePredictions = TRUE # Store predictions for later analysis
)
# Train kNN model using caret
knn_model <- train(
okataina ~ ., data = rocks_knn_scaled,
method = "knn",
metric = "ROC", # Use AUC as selection metric
trControl = ctrl_knn,
tuneLength = 10 # Try 10 different k values automatically
)
# Display model summary
print(knn_model)
## k-Nearest Neighbors
##
## 330 samples
## 11 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 297, 297, 297, 297, 297, 296, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.8682615 0.4750000 0.9862069
## 7 0.8737213 0.3666667 0.9896552
## 9 0.8971552 0.2916667 0.9896552
## 11 0.9016523 0.2416667 0.9931034
## 13 0.8957328 0.2166667 1.0000000
## 15 0.8955891 0.2166667 1.0000000
## 17 0.8846839 0.1666667 1.0000000
## 19 0.8740086 0.1000000 1.0000000
## 21 0.8671121 0.1000000 1.0000000
## 23 0.8700000 0.0750000 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
# Plot ROC vs k
plot(knn_model, main = "kNN Cross-validated AUC vs k")
From the model summary above, we choose our best
k=11.
# Extract predictions from best-tuned model
best_knn_pred <- knn_model$pred %>%
filter(k == knn_model$bestTune$k)
# Generate confusion matrix using cross-validated predictions
conf_matrix_knn <- confusionMatrix(
data = best_knn_pred$pred,
reference = best_knn_pred$obs,
positive = "Yes"
)
print(conf_matrix_knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 9 2
## Yes 30 289
##
## Accuracy : 0.903
## 95% CI : (0.8659, 0.9327)
## No Information Rate : 0.8818
## P-Value [Acc > NIR] : 0.1325
##
## Kappa : 0.3249
##
## Mcnemar's Test P-Value : 1.815e-06
##
## Sensitivity : 0.9931
## Specificity : 0.2308
## Pos Pred Value : 0.9060
## Neg Pred Value : 0.8182
## Prevalence : 0.8818
## Detection Rate : 0.8758
## Detection Prevalence : 0.9667
## Balanced Accuracy : 0.6119
##
## 'Positive' Class : Yes
##
The kNN model was trained using 10-fold cross-validation, with
automatic tuning over a range of k values. The optimal number of
neighbors was selected at k=11, based on the highest
ROC=~0.90 .
Evaluation using the best model revealed
Sensitivity=0.99, Specificity = 0.23, as shown
in the confusion matrix.
# Compute ROC curve for best model
roc_knn <- roc(
response = best_knn_pred$obs,
predictor = best_knn_pred$Yes, # Probability of class "Yes"
levels = c("No", "Yes")
)
# Plot ROC curve
plot(roc_knn,
col = "darkgreen",
lwd = 2,
main = "kNN ROC Curve (10-fold Cross-validation)")
abline(a = 0, b = 1, lty = 2, col = "gray")
# Output AUC value
auc(roc_knn)
## Area under the curve: 0.8958
The ROC curve confirms strong discriminative power, with an AUC of
0.8958.
Now, we put key performance values of random forest tree and kNN together.
# Extract best predictions from each model
rf_pred_best <- tree.rf$pred %>%
filter(mtry == tree.rf$bestTune$mtry)
knn_pred_best <- knn_model$pred %>%
filter(k == knn_model$bestTune$k)
# Compute confusion matrices
rf_conf <- confusionMatrix(rf_pred_best$pred, rf_pred_best$obs, positive = "Yes")
knn_conf <- confusionMatrix(knn_pred_best$pred, knn_pred_best$obs, positive = "Yes")
# Build comparison table
comparison <- tibble::tibble(
Model = c("Random Forest", "kNN"),
ROC = c(
max(tree.rf$results$ROC),
max(knn_model$results$ROC)
),
Sensitivity = c(
rf_conf$byClass["Sensitivity"],
knn_conf$byClass["Sensitivity"]
),
Specificity = c(
rf_conf$byClass["Specificity"],
knn_conf$byClass["Specificity"]
)
)
# Round and display
comparison %>%
mutate(across(where(is.numeric), round, 3)) %>%
kable(caption = "Cross-validated Performance of Random Forest and kNN (based on actual predictions)")
| Model | ROC | Sensitivity | Specificity |
|---|---|---|---|
| Random Forest | 0.979 | 1.000 | 0.385 |
| kNN | 0.902 | 0.993 | 0.231 |
The table above compares the performance of the Random Forest and
k-Nearest Neighbors classifiers(kNN). The Random Forest model achieved
the highest AUC (~0.98), perfect sensitivity
(1.00), and higher specificity (0.39),
suggesting it performs well in both correctly identifying Okataina
samples and avoiding false positives.
In contrast, the kNN model showed slightly lower sensitivity
(0.99) and lower specificity (0.23), meaning
it was both slightly less accurate in detecting Okataina and more prone
to incorrectly labeling non-Okataina samples as Okataina. Despite the
minor drop in recall, the model appears to assign the Okataina label
more freely, which increases the risk of false positives.
Given our earlier emphasis on specificity, Random Forest may be preferable.
To solve the rf-tree’s low specificity problem, we could manually adjust the classification threshold to increase our confidence in positive predictions.
By default, classification models assign the “Yes” label (i.e., positive class) when the predicted probability exceeds 0.5. However, when specificity is prioritized—such as in this geological classification task, where false positives may lead to misinterpretation of provenance—it is advisable to raise this threshold. Doing so makes the model more conservative in assigning the “Yes” class, thereby reducing the false positive rate and improving specificity.
This approach represents a post-hoc calibration technique where we trade off sensitivity for higher specificity. We can identify an optimal threshold by evaluating metrics such as the ROC curve, precision-recall tradeoff, or maximizing the Youden index (sensitivity + specificity - 1).
# Define threshold values to evaluate
thresholds <- seq(0.1, 0.9, by = 0.05)
# Initialize empty results frame
youden_df <- data.frame()
# Loop over thresholds to compute sensitivity, specificity, and Youden index
for (t in thresholds) {
pred_label <- ifelse(rf_pred_best$Yes > t, "Yes", "No") %>%
factor(levels = c("No", "Yes"))
cm <- confusionMatrix(pred_label, rf_pred_best$obs, positive = "Yes")
sens <- cm$byClass["Sensitivity"]
spec <- cm$byClass["Specificity"]
youden <- sens + spec - 1
youden_df <- rbind(youden_df, data.frame(
Threshold = t,
Sensitivity = round(sens, 3),
Specificity = round(spec, 3),
YoudenIndex = round(youden, 3)
))
}
# Show table
youden_df
# Get the top 5 thresholds with the highest Youden Index
top5_youden <- youden_df %>%
arrange(desc(YoudenIndex)) %>%
head(5)
# Show the top 5 thresholds
top5_youden
To identify the optimal decision threshold, we plotted sensitivity, specificity, and the Youden Index across a range of thresholds from 0.1 to 0.9. The curve shows how increasing the threshold improves specificity while reducing sensitivity.
The Youden Index peaks at a threshold of 0.75 and 0.80, suggesting these provide the best trade-off between true positive and true negative rates. This visualization supports threshold tuning as a post-hoc calibration method when the default cutoff (0.5) does not align with domain-specific priorities.
However, it is not always realistic to choose the threshold value at the peak of the Youden Index, considering that real-world data distributions vary. By experiment couple times, we set our threshold at 0.70.
# Manually set threshold = 0.70
threshold <- 0.70
# Classify predictions based on threshold
rf_pred_best$custom_pred <- ifelse(rf_pred_best$Yes > threshold, "Yes", "No") %>%
factor(levels = c("No", "Yes"))
# Compute confusion matrix
cm_80 <- confusionMatrix(rf_pred_best$custom_pred, rf_pred_best$obs, positive = "Yes")
print(cm_80)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 31 4
## Yes 8 287
##
## Accuracy : 0.9636
## 95% CI : (0.9373, 0.9811)
## No Information Rate : 0.8818
## P-Value [Acc > NIR] : 1.239e-07
##
## Kappa : 0.8174
##
## Mcnemar's Test P-Value : 0.3865
##
## Sensitivity : 0.9863
## Specificity : 0.7949
## Pos Pred Value : 0.9729
## Neg Pred Value : 0.8857
## Prevalence : 0.8818
## Detection Rate : 0.8697
## Detection Prevalence : 0.8939
## Balanced Accuracy : 0.8906
##
## 'Positive' Class : Yes
##
Using a manually selected threshold of 0.70, we
recalculated the classification results for the Random Forest model.
This threshold yielded a specificity of 0.79 and a
sensitivity of 0.99—providing a strong balance while
prioritizing low false positive rates.
The ROC curve includes a marker at this threshold, showing the trade-off between sensitivity and specificity. A density plot of predicted probabilities by class further visualizes the separation between “Yes” and “No” samples relative to the threshold.
# Step 1: Remove rows with missing values from the unlabeled dataset
rocks_unlabelled_clean <- na.omit(rocks_unlabelled)
# Step 2: Standardize the predictors using the same model from training
rocks_unlabelled_scaled <- predict(preproc_knn, rocks_unlabelled_clean)
# Step 3: Predict class probabilities using the trained Random Forest model
rf_probs <- predict(tree.rf, newdata = rocks_unlabelled_scaled, type = "prob")
# Step 4: Apply threshold = 0.70 to assign class labels
rf_pred_class <- ifelse(rf_probs$Yes > 0.70, "Okataina", "Not Okataina")
# Step 5: Combine results with predicted class and probability
rf_pred_result <- rocks_unlabelled_clean %>%
mutate(
Predicted_Source = rf_pred_class,
Probability_Okataina = round(rf_probs$Yes, 3)
)
# Step 6: View first few predictions
rf_pred_result[, c("Predicted_Source", "Probability_Okataina")]
# Count predicted categories
table(rf_pred_result$Predicted_Source)
##
## Not Okataina Okataina
## 7 21
# Show proportion
prop.table(table(rf_pred_result$Predicted_Source))
##
## Not Okataina Okataina
## 0.25 0.75
The histogram of predicted probabilities indicates the confidence spread, while the sorted bar chart clearly identifies which samples have the strongest predicted likelihood of being from Okataina. These visualizations support interpretation and downstream selection of high-confidence samples.
The distribution of predicted classes for the unlabelled samples shows that X out of Y were classified as “okataina” under the 0.70 threshold.
While the Random Forest (RF) model achieved high overall
classification performance (AUC ~0.90), several limitations
suggest potential concerns regarding its generalizability to new or
unseen data:
Overfitting Risk on Imbalanced Data
The dataset exhibits strong class imbalance, with over 88% of samples
labelled as “Okataina”. Although RF can handle imbalance better than
single trees, the model may still overfit to the majority class, leading
to inflated accuracy and potentially low recall on minority
classes.
Low Specificity in Default Settings
Without manual threshold tuning, the RF model showed near-perfect
sensitivity but very low specificity (≈ 0.38). This behavior indicates
that the model tends to over-predict the positive class, which could
reduce reliability in negative class detection (i.e., identifying
non-Okataina rocks).
Interpretability and Decision Transparency
Despite being more interpretable than neural networks, RF is still an
ensemble of many deep trees. Understanding individual decision paths or
explaining why a specific rock was classified as “Okataina” remains
difficult without post-hoc analysis (e.g., SHAP values).
Sensitivity to Input Noise and Missing Data
During prediction on the unlabeled dataset, several samples had to be
removed due to missing values. RF does not natively handle missing
values unless specifically engineered. This can limit robustness in
real-world scenarios where geochemical data is often
incomplete.
Fixed Feature Importance
The model assumes all features are equally useful unless told otherwise.
In geochemical datasets, some elements may be highly collinear or
irrelevant. Without embedded feature selection, RF may be vulnerable to
redundancy or noise in predictors.
While Random Forest is a powerful and flexible classifier, its performance in this case required manual calibration (e.g., threshold tuning) to meet domain-specific goals such as minimizing false positives. For broader deployment or automated applications, additional safeguards—such as feature selection, uncertainty quantification, and external validation—may be necessary to ensure robust generalization.
In this question, you are not allowed to use any pre-implemented modules for performing k-means, expectation maximisation, or other clustering algorithms. You will need to provide your own implementation.
We will work with a dataset seeds.csv containing
measurements of a large number of seeds. Each seed is measured in terms
of “Area”, “Perimeter”, “MajorAxisLength”, “ConvexArea”, and other
optical characteristics of this seed. Your goal in this question will be
to find clusters in this dataset, i.e. try and find groups of “different
types of seed” (the seeds come from different plants, but this
information has been lost, so all we can do is try and estimate how many
different types of seeds we have).
Perform data preprocessing, if appropriate.
Investigate the question “how many types of seeds are there”, and present your results briefly, but coherently.
Opportunities for showing extra effort: The additional
dataset seeds_annotated.csv contains (few) datapoints which
have class labels (A,B,…,G), annotated by an expert of seeds. Use this,
in combination with your results from your clustering analysis, to train
a seed type prediction algorithm that takes the measurements of a seed
and predicts its type. Test your algorithm in a suitable way and
communicate your findings (you need to be careful not to check
performance purely on the annotated training set). [You do not need to
implement your algorithms yourself in this part of the question, feel
free to use whatever code works best for you]
# Define the data directory path (relative to the current working directory)
data_dir <- ("../data")
# Construct full file paths to the annotated and unlabelled seed datasets
path_annotated <- file.path(data_dir, "seeds_annotated.csv")
path_unlabeled <- file.path(data_dir, "seeds.csv")
# Read the annotated data, assuming decimal commas (",") are used in numeric values
annotated <- read_csv(
path_annotated,
locale = locale(decimal_mark = ","),
show_col_types = FALSE # Suppress column type messages
)
# Read the unlabelled seed data using the same locale settings
unlabeled <- read_csv(
path_unlabeled,
locale = locale(decimal_mark = ","),
show_col_types = FALSE
)
annotated <- select(annotated,-...1) # Removes the column named ...1
unlabeled <- select(unlabeled,-...1)
skim(unlabeled)
| Name | unlabeled |
| Number of rows | 13511 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Area | 0 | 1 | 53032.20 | 29297.77 | 20420.00 | 36323.50 | 44659.00 | 61289.00 | 254616.00 | ▇▂▁▁▁ |
| Perimeter | 0 | 1 | 855.21 | 214.19 | 524.74 | 703.45 | 794.98 | 976.54 | 1985.37 | ▇▆▁▁▁ |
| MajorAxisLength | 0 | 1 | 320.12 | 85.66 | 183.60 | 253.30 | 296.90 | 376.42 | 738.86 | ▇▆▂▁▁ |
| MinorAxisLength | 0 | 1 | 202.23 | 44.94 | 122.51 | 175.79 | 192.39 | 217.02 | 460.20 | ▇▇▁▁▁ |
| AspectRation | 0 | 1 | 1.58 | 0.25 | 1.02 | 1.43 | 1.55 | 1.71 | 2.43 | ▂▇▅▂▁ |
| Eccentricity | 0 | 1 | 0.75 | 0.09 | 0.22 | 0.72 | 0.76 | 0.81 | 0.91 | ▁▁▂▇▇ |
| ConvexArea | 0 | 1 | 53752.32 | 29748.74 | 20684.00 | 36713.00 | 45200.00 | 62249.00 | 263261.00 | ▇▂▁▁▁ |
| EquivDiameter | 0 | 1 | 253.03 | 59.14 | 161.24 | 215.05 | 238.46 | 279.35 | 569.37 | ▇▆▁▁▁ |
| Extent | 0 | 1 | 0.75 | 0.05 | 0.56 | 0.72 | 0.76 | 0.79 | 0.87 | ▁▁▅▇▂ |
| Solidity | 0 | 1 | 0.99 | 0.00 | 0.92 | 0.99 | 0.99 | 0.99 | 0.99 | ▁▁▁▁▇ |
| roundness | 0 | 1 | 0.87 | 0.06 | 0.49 | 0.83 | 0.88 | 0.92 | 0.99 | ▁▁▂▇▇ |
| Compactness | 0 | 1 | 0.80 | 0.06 | 0.64 | 0.76 | 0.80 | 0.83 | 0.99 | ▂▅▇▂▁ |
| ShapeFactor1 | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | ▁▃▇▃▁ |
| ShapeFactor2 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▇▇▇▃▁ |
| ShapeFactor3 | 0 | 1 | 0.64 | 0.10 | 0.41 | 0.58 | 0.64 | 0.70 | 0.97 | ▂▇▇▂▁ |
| ShapeFactor4 | 0 | 1 | 1.00 | 0.00 | 0.95 | 0.99 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
No missing data found in unlabeled.
skim(annotated)
| Name | annotated |
| Number of rows | 100 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Class | 0 | 1 | 1 | 1 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Area | 0 | 1 | 55220.93 | 32778.32 | 25319.00 | 37335.25 | 44505.00 | 63994.50 | 198671.00 | ▇▃▁▁▁ |
| Perimeter | 0 | 1 | 864.65 | 228.81 | 590.34 | 707.56 | 792.64 | 998.67 | 1686.77 | ▇▅▂▁▁ |
| MajorAxisLength | 0 | 1 | 322.85 | 91.19 | 211.57 | 253.04 | 295.36 | 381.25 | 623.27 | ▇▃▅▁▁ |
| MinorAxisLength | 0 | 1 | 207.27 | 49.35 | 134.45 | 180.20 | 196.41 | 217.85 | 410.10 | ▅▇▁▁▁ |
| AspectRation | 0 | 1 | 1.56 | 0.24 | 1.11 | 1.37 | 1.57 | 1.70 | 2.16 | ▅▅▇▂▂ |
| Eccentricity | 0 | 1 | 0.74 | 0.10 | 0.44 | 0.68 | 0.77 | 0.81 | 0.89 | ▁▂▂▇▆ |
| ConvexArea | 0 | 1 | 55914.22 | 33217.58 | 25664.00 | 37653.00 | 44921.50 | 65094.25 | 201134.00 | ▇▃▁▁▁ |
| EquivDiameter | 0 | 1 | 257.32 | 64.32 | 179.55 | 218.03 | 238.05 | 285.44 | 502.95 | ▇▅▁▁▁ |
| Extent | 0 | 1 | 0.75 | 0.05 | 0.60 | 0.72 | 0.77 | 0.79 | 0.83 | ▁▂▅▇▆ |
| Solidity | 0 | 1 | 0.99 | 0.00 | 0.97 | 0.99 | 0.99 | 0.99 | 0.99 | ▁▁▂▇▆ |
| roundness | 0 | 1 | 0.88 | 0.05 | 0.77 | 0.85 | 0.88 | 0.93 | 0.98 | ▃▃▇▅▅ |
| Compactness | 0 | 1 | 0.81 | 0.06 | 0.68 | 0.76 | 0.80 | 0.85 | 0.95 | ▂▆▇▃▃ |
| ShapeFactor1 | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | ▁▃▇▃▁ |
| ShapeFactor2 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▇▆▆▃▂ |
| ShapeFactor3 | 0 | 1 | 0.65 | 0.10 | 0.46 | 0.58 | 0.64 | 0.73 | 0.90 | ▃▇▅▃▂ |
| ShapeFactor4 | 0 | 1 | 1.00 | 0.00 | 0.98 | 0.99 | 1.00 | 1.00 | 1.00 | ▁▁▂▅▇ |
No missing data found in annotated.
# Select only the numeric columns from the unlabeled dataset
# This is important because scaling only applies to continuous numeric variables
unlabeled_scaled <- select(unlabeled, where(is.numeric)) |>
# Apply standardization (z-score scaling) to each numeric column
# This transforms each feature to have mean = 0 and standard deviation = 1
scale() |>
# Convert the result from a matrix back to a tibble for compatibility with tidyverse
as_tibble()
skim(unlabeled_scaled)
| Name | unlabeled_scaled |
| Number of rows | 13511 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Area | 0 | 1 | 0 | 1 | -1.11 | -0.57 | -0.29 | 0.28 | 6.88 | ▇▂▁▁▁ |
| Perimeter | 0 | 1 | 0 | 1 | -1.54 | -0.71 | -0.28 | 0.57 | 5.28 | ▇▆▁▁▁ |
| MajorAxisLength | 0 | 1 | 0 | 1 | -1.59 | -0.78 | -0.27 | 0.66 | 4.89 | ▇▆▂▁▁ |
| MinorAxisLength | 0 | 1 | 0 | 1 | -1.77 | -0.59 | -0.22 | 0.33 | 5.74 | ▇▇▁▁▁ |
| AspectRation | 0 | 1 | 0 | 1 | -2.26 | -0.61 | -0.13 | 0.50 | 3.43 | ▂▇▅▂▁ |
| Eccentricity | 0 | 1 | 0 | 1 | -5.79 | -0.38 | 0.15 | 0.65 | 1.75 | ▁▁▂▇▇ |
| ConvexArea | 0 | 1 | 0 | 1 | -1.11 | -0.57 | -0.29 | 0.29 | 7.04 | ▇▂▁▁▁ |
| EquivDiameter | 0 | 1 | 0 | 1 | -1.55 | -0.64 | -0.25 | 0.44 | 5.35 | ▇▆▁▁▁ |
| Extent | 0 | 1 | 0 | 1 | -3.96 | -0.63 | 0.21 | 0.76 | 2.37 | ▁▁▅▇▂ |
| Solidity | 0 | 1 | 0 | 1 | -14.55 | -0.32 | 0.24 | 0.62 | 1.62 | ▁▁▁▁▇ |
| roundness | 0 | 1 | 0 | 1 | -6.44 | -0.69 | 0.17 | 0.73 | 1.97 | ▁▁▂▇▇ |
| Compactness | 0 | 1 | 0 | 1 | -2.58 | -0.61 | 0.02 | 0.56 | 3.04 | ▂▅▇▂▁ |
| ShapeFactor1 | 0 | 1 | 0 | 1 | -3.36 | -0.59 | 0.07 | 0.63 | 3.45 | ▁▃▇▃▁ |
| ShapeFactor2 | 0 | 1 | 0 | 1 | -1.93 | -0.94 | -0.04 | 0.76 | 3.27 | ▇▇▇▃▁ |
| ShapeFactor3 | 0 | 1 | 0 | 1 | -2.36 | -0.63 | -0.01 | 0.53 | 3.35 | ▂▇▇▂▁ |
| ShapeFactor4 | 0 | 1 | 0 | 1 | -10.84 | -0.31 | 0.30 | 0.64 | 1.07 | ▁▁▁▁▇ |
The reason we do Z-score normalisation is that clustering algorithms such as k-means are distance-based (typically Euclidean distance). If one variable has a much larger scale than others (e.g., 100s vs. 0.1s), it will dominate the distance metric and bias the clustering results.
Z-score standardization ensures that all features contribute equally to the clustering process.
# Compute squared distance from each row of X to a single centroid
delta_to_centroid <- function(X, centroid) {
n <- nrow(X)
p <- ncol(X)
rowSums((X - matrix(centroid, n, p, byrow = TRUE))^2) # ensure conformable dimensions
}
# Compute a distance matrix from all points to all centroids
delta_matrix <- function(X, centroids) {
k <- nrow(centroids)
sapply(seq_len(k), function(i) delta_to_centroid(X, centroids[i, ]))
}
# Compute total within-cluster sum of squares using f_kmeans output
compute_wss <- function(k, data, max_iter, start) {
model <- f_kmeans(data, k, max_iter = max_iter, start = start)
X <- as.matrix(data) # ensure matrix format
clusters <- model$clusters # vector of cluster assignments
centroids <- model$centroids # matrix of final centroids
total_withinss <- sum(sapply(seq_len(k), function(j) {
idx <- which(clusters == j) # indices assigned to cluster j
if (length(idx) > 0) {
point_set <- X[idx, , drop = FALSE] # all points in cluster j
centroid_row <- matrix(centroids[j, , drop = FALSE],
nrow = length(idx),
ncol = ncol(X),
byrow = TRUE) # broadcast centroid for subtraction
sum(rowSums((point_set - centroid_row)^2)) # WCSS for cluster j
} else {
0
}
}))
return(total_withinss)
}
# Custom K-means implementation
f_kmeans <- function(data, k, max_iter = 100, start = 10) {
set.seed(82171165)
X <- as.matrix(data) # force to matrix
n <- nrow(X)
p <- ncol(X)
centroids <- X[sample(n), , drop = FALSE][1:k, ] # random initial centroids
clusters <- integer(n) # empty cluster assignment
iter_reached <- NA
for (iter in seq_len(max_iter)) {
distances <- delta_matrix(X, centroids) # distance matrix
clusters_new <- max.col(-distances) # assign nearest centroid
has_converged <- !anyNA(clusters_new) && all(clusters_new == clusters)
if (has_converged) {
iter_reached <- iter
break
}
clusters <- clusters_new
# Recompute centroids
for (j in seq_len(k)) {
idx <- which(clusters == j)
if (length(idx) > 0) {
centroids[j, ] <- colMeans(X[idx, , drop = FALSE]) # new centroid
}
}
}
# Recompute WCSS at the end with safe dimension handling
total_withinss <- sum(sapply(seq_len(k), function(j) {
idx <- which(clusters == j)
if (length(idx) > 0) {
point_set <- X[idx, , drop = FALSE]
centroid_row <- matrix(centroids[j, , drop = FALSE],
nrow = length(idx),
ncol = ncol(X),
byrow = TRUE)
sum(rowSums((point_set - centroid_row)^2))
} else {
0
}
}))
# Return result object
list(
clusters = clusters,
centroids = centroids,
iter = iter_reached,
total_withinss = total_withinss
)
}
Custom Implementation of the K-means Clustering Algorithm:
Computes squared Euclidean distances between data points and centroids.
Assigns each data point to the nearest cluster.
Recomputes centroids as the mean of assigned points.
Iterates the above steps until convergence or a maximum number of iterations is reached.
Computation of Within-Cluster Sum of Squares (WSS):
For each specified number of clusters k, the algorithm computes the WSS.
WSS measures the compactness of clusters — lower values indicate tighter clusters.
Preparation for Determining Optimal Cluster Number:
WSS values for a range of k (typically 2 to 10) are stored.
These values can be visualised in an elbow plot to help determine the optimal number of clusters using the elbow method.
Before running whole data rows (more than 10k), we should conduct a test first.
This code block verifies the fundamental decomposition identity in clustering: \[{Total \ Sum\ of\ Squares (TotSS)} = \text{Between-Cluster SS (BSS)} + \text{Within-Cluster SS (WSS)}\]
It runs the custom f_kmeans() implementation on several values of k (from 2 to 10), and for each: Calculates TotSS directly from the full dataset. Computes WSS using the result of K-means. Derives BSS by subtraction. Checks whether the identity approximately holds for each value of k, and outputs all values into a structured table. This acts as a diagnostic step to validate the correctness of the K-means implementation.
# Define a flag to control whether to run a test mode with a subset of data or the full mode. This is useful for quick debugging.
testing <- FALSE
# Conditionally sets up a test configuration with fewer data rows and smaller k values if testing = TRUE, or sets full run parameters otherwise.
#
# unlabeled_use is the standardised data passed to the clustering algorithm.
#
# k_vals defines the range of clusters to evaluate.
#
# max_iter limits the maximum number of K-means iterations per run.
if (testing) {
message("🟡 Quick test: 500 rows, k=2:4, iter.max=5")
unlabeled_use <- unlabeled_scaled[1:500,]; k_vals <- 2:4; max_iter <- 5
} else {
message("🟢 Full run: all rows, k=2:10, iter.max=100")
unlabeled_use <- unlabeled_scaled; k_vals <- 2:10; max_iter <- 100
}
# Redundantly resets k_vals, could be removed for efficiency, since it’s already defined above.
k_vals <- 2:10
# Similarly redundant; already set above, but safe to keep for clarity.
unlabeled_use <- unlabeled_scaled
# Computes the Within-Cluster Sum of Squares (WSS) for each k using the custom compute_wss() function and stores results in a numeric vector.
# vapply() is used instead of sapply() for stricter type safety.
wss_values <- vapply(k_vals, function(k) compute_wss(k, unlabeled_use, max_iter, start), numeric(1))
unlabeled_use <- unlabeled_scaled
This chunk prepares clustering evaluation by:
Running the custom K-means algorithm across a range of k values.
Measuring the WSS (compactness) for each.
Preparing the results for plotting the elbow curve to determine the optimal number of clusters.
# This chunk verifies the decomposition identity:
# Total Sum of Squares = Within-Cluster SS + Between-Cluster SS
debug_tbl <- purrr::map_dfr(k_vals, function(k) {
# Run custom K-means with specified number of clusters
res <- f_kmeans(unlabeled_scaled, k, max_iter)
# Compute within-cluster sum of squares using the WSS function
tot_within <- compute_wss(k, unlabeled_use, max_iter)
# Compute the overall (grand) mean vector of the entire dataset
grand_mean <- colMeans(unlabeled_use)
# Compute total sum of squares (TotSS)
# This measures how far each data point is from the grand mean
totss <- sum(rowSums(
(as.matrix(unlabeled_use) -
matrix(grand_mean,
nrow = nrow(unlabeled_use),
ncol = ncol(unlabeled_use),
byrow = TRUE))^2
))
# Compute between-cluster sum of squares (BetSS)
# This is the difference between total and within-cluster variation
betweenss <- totss - tot_within
# Return a tibble with current k and diagnostic values
# identity_ok checks that TotSS ≈ WSS + BSS (with tiny numerical tolerance)
tibble(
k = k,
tot_within = tot_within,
totss = totss,
betweenss = betweenss,
identity_ok = abs(totss - (betweenss + tot_within)) < 1e-8
)
})
# Display the full diagnostic table for all tested k values
print(debug_tbl)
## # A tibble: 9 × 5
## k tot_within totss betweenss identity_ok
## <int> <dbl> <dbl> <dbl> <lgl>
## 1 2 128933. 216160. 87227. TRUE
## 2 3 109560. 216160. 106600. TRUE
## 3 4 85203. 216160. 130957. TRUE
## 4 5 61469. 216160. 154691. TRUE
## 5 6 55144. 216160. 161016. TRUE
## 6 7 48475. 216160. 167685. TRUE
## 7 8 46292. 216160. 169868. TRUE
## 8 9 42756. 216160. 173404. TRUE
## 9 10 40916. 216160. 175244. TRUE
The Elbow Method is based on the idea of diminishing
returns in clustering quality. As the number of clusters k
increases, the total within-cluster sum of squares
(WSS) always decreases—since each point is closer to its
cluster center. However, this improvement is not linear. Initially,
adding clusters greatly reduces WSS, but after a certain point, the
marginal gain becomes minimal. This point of inflection, where the curve
“bends” like an elbow, indicates a suitable number of clusters that
balances compactness with model simplicity.
# Set up clustering input data and the range of k (number of clusters)
unlabeled_use <- unlabeled_scaled
k_vals <- 2:10
max_iter <- 100
start <- 10
# Compute WSS for each k using the custom K-means function
# This gives the total within-cluster sum of squares for Elbow plot
wss_values <- vapply(
k_vals,
function(k) compute_wss(k, unlabeled_use, max_iter, start),
numeric(1)
)
# Create Elbow Plot (WSS vs. k)
plot_wss <- qplot(k_vals, wss_values, geom = c("point", "line"),
xlab = "Number of Clusters", ylab = "Total WCSS") +
ggtitle("Elbow Method") +
theme_minimal()
The Silhouette Method
On the other hand, evaluates the cohesion and separation of clusters from a geometric perspective. For each data point, the silhouette score compares:
\(a(i)\):the average distance to other points in the same cluster (intra-cluster distance)
\(b(i)\):the average distance to points in the nearest neighbouring cluster (inter-cluster distance)
The silhouette score is defined as:
\[s(i)=\frac{b(i)-a(i)}{\max \{a(i), b(i)\}}\] Values close to 1 indicate well-separated and compact clusters, while values near 0 suggest overlapping clusters. The average silhouette width across all points serves as a global score of clustering quality. The k value that maximises this average silhouette score is often considered optimal.
By combining Elbow and Silhouette methods together allows us to cross-validate the choice of k, balancing interpretability (elbow) and structure quality (silhouette).
# Compute average silhouette width for each k
# Silhouette width measures how well points fit within their cluster vs. the next closest cluster
diss_matrix <- dist(unlabeled_use) # Compute Euclidean distance matrix once
silhouette_values <- vapply(k_vals, function(k) {
model <- f_kmeans(unlabeled_use, k, max_iter = max_iter, start = start) # run clustering
if (!is.null(model$clusters)) {
sil <- silhouette(model$clusters, diss_matrix) # silhouette object
mean(sil[, 3]) # extract average silhouette width
} else {
NA_real_ # in case of error or no clusters
}
}, numeric(1))
# Create Silhouette Plot (Avg. silhouette width vs. k)
plot_sil <- qplot(k_vals, silhouette_values, geom = c("point", "line"),
xlab = "Number of Clusters", ylab = "Average Silhouette Width") +
ggtitle("Silhouette Method") +
theme_minimal()
# Show the silhouette plot followed by the elbow plot
plot_sil
Elbow and Silhouette plots for determining optimal k
plot_wss
Elbow and Silhouette plots for determining optimal k
The silhouette method gives a quality score (higher is better).
The elbow method helps locate the “knee point” where WSS reduction starts to slow down.
Running both provides a cross-validation of the optimal number of clusters.
The total within-cluster sum of squares (WSS) decreases as the number of clusters(k) increases, which is expected. The “elbow” point appears around k = 5 or k=7: Before k = 5, the reduction in WSS is sharp, while k=7 is the second “elbow” point. After k = 5, the gains in compactness diminish, indicating diminishing returns. This suggests that 5 or 7 clusters may be a good balance between underfitting and overfitting.
The average silhouette width is highest at k = 2, with a value close to 0.4. However, the silhouette score drops significantly at k = 3, then peaks again slightly at k = 5, before steadily declining. A local maximum at k = 5 suggests that this value produces relatively well-separated and compact clusters compared to neighbouring options. Although k = 2 achieves the highest silhouette score, it may be too coarse (only 2 clusters). The local peak at k = 5 offers a more meaningful trade-off between separation and granularity.
So, our final optimal clustering number should suggest
k=5.
# ══════════════════════════════════════════════════════════════
# Assign inferred class labels to each cluster by comparing centroid proximity
# ══════════════════════════════════════════════════════════════
# Step 0: Run K-means clustering with k = 5 (determined earlier from Elbow/Silhouette)
res_final <- f_kmeans(unlabeled_scaled, k = 5, max_iter = 100)
# ──────────────────────────────────────────────────────────────
# Step 1: Compute cluster centroids in the scaled unlabeled dataset
cluster_centroids <- unlabeled_scaled %>%
as.data.frame() %>%
mutate(cluster = res_final$clusters) %>% # Assign cluster labels
group_by(cluster) %>%
summarise(across(everything(), mean), .groups = "drop") # Mean per cluster
# ──────────────────────────────────────────────────────────────
# Step 2: Compute centroids of known classes in scaled annotated data
annotated_scaled <- annotated %>%
mutate(Class = as.factor(Class)) %>%
select(-Class) %>%
scale() %>%
as.data.frame()
# Bind class labels back after scaling
annotated_centroids <- annotated_scaled %>%
mutate(Class = annotated$Class) %>%
group_by(Class) %>%
summarise(across(everything(), mean), .groups = "drop")
# ──────────────────────────────────────────────────────────────
# Step 3: Calculate Euclidean distances between cluster and class centroids
cluster_matrix <- as.matrix(select(cluster_centroids, -cluster))
class_matrix <- as.matrix(select(annotated_centroids, -Class))
distance_matrix <- as.matrix(dist(rbind(cluster_matrix, class_matrix)))
# Extract k × m distance submatrix: rows = clusters, columns = classes
k <- nrow(cluster_matrix)
m <- nrow(class_matrix)
prox_matrix <- distance_matrix[1:k, (k + 1):(k + m)]
# ──────────────────────────────────────────────────────────────
# Step 4: Assign each cluster to the closest known class
closest_class_index <- apply(prox_matrix, 1, which.min) # for each cluster, find closest class
closest_class_labels <- annotated_centroids$Class[closest_class_index]
# Create a named vector: cluster ID → class label
cluster_to_label <- setNames(as.character(closest_class_labels), cluster_centroids$cluster)
# Add assigned class labels to each unlabeled sample (based on its cluster)
unlabeled_use <- unlabeled_scaled %>%
as.data.frame() %>%
mutate(cluster = res_final$clusters) %>%
mutate(label = cluster_to_label[as.character(cluster)])
This chunk aligns unsupervised cluster assignments with known class labels by comparing centroids in the standardised feature space. Each cluster is matched to the closest annotated class using Euclidean distance between centroids, enabling interpretable labeling of previously unlabeled data. This provides a bridge between unsupervised clustering and semi-supervised classification.
This part outlines the implementation of Principal Component Analysis (PCA) on the standardised feature set of the unlabelled dataset to reduce dimensionality and visualise clustering results. The first three principal components (PC1, PC2, PC3) are extracted and plotted in a 3D scatter plot, with each data point coloured according to its inferred class label.
# Perform Principal Component Analysis (PCA) on the scaled unlabeled dataset
pca_result <- PCA(unlabeled_scaled, graph = FALSE)
# Extract coordinates of individuals on the first 3 principal components
pca_coords <- as.data.frame(pca_result$ind$coord[, 1:3])
colnames(pca_coords) <- c("PC1", "PC2", "PC3")
# Assign predicted class labels to each observation for colouring
pca_coords$Class <- unlabeled_use$label
# Define a discrete colour palette using RColorBrewer
pal <- brewer.pal(max(3, length(unique(pca_coords$Class))), "Set2")
# Create an interactive 3D scatter plot of the first three principal components
plot_ly(data = pca_coords,
x = ~PC1, y = ~PC2, z = ~PC3,
color = ~Class, colors = pal,
type = "scatter3d", mode = "markers",
marker = list(size = 3)) |>
layout(title = "PCA – Colored by Assigned Class Labels",
legend = list(title = list(text = "Class")),
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
))
# Perform HCPC (Hierarchical Clustering on Principal Components) and 3D visualisation
if (!exists("res.pca") || !exists("res.hcpc")) {
hcpc_data <- scale(unlabeled) # Standardise the original unlabeled data
res.pca <- PCA(hcpc_data, graph = FALSE) # Run PCA on scaled data (no graph output)
res.hcpc <- HCPC(res.pca, graph = FALSE) # Run HCPC on PCA results (no graph output)
}
# Extract coordinates of the first three principal components
pca_coords <- as.data.frame(res.pca$ind$coord[, 1:3])
colnames(pca_coords) <- c("PC1", "PC2", "PC3")
# Add cluster information (from earlier k-means) for colouring
pca_coords$cluster <- factor(unlabeled_use$cluster)
# Create colour palette
pal <- brewer.pal(max(3, length(levels(pca_coords$cluster))), "Set2")
# Interactive plot for HTML output
if (knitr::is_html_output()) {
plot_ly(
data = pca_coords,
x = ~PC1, y = ~PC2, z = ~PC3,
type = "scatter3d", mode = "markers",
color = ~cluster, colors = pal,
marker = list(size = 4)
) |>
layout(
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
),
legend = list(title = list(text = "Cluster"))
)
} else {
# Fallback 2D plot for non-HTML output (e.g. PDF)
fviz_cluster(
res.hcpc,
geom = "point", repel = TRUE,
show.clust.cent = TRUE,
axes = c(1, 2),
palette = pal
)
}
This code block performs Hierarchical Clustering on Principal
Components (HCPC) using the FactoMineR and
factoextra packages.
The K-means algorithm produced 5 clusters based on the elbow and
silhouette criteria, whereas HCPC (Hierarchical Clustering on Principal
Components) automatically selected 3 clusters. This discrepancy
highlights the methodological difference: K-means relies on
pre-specified k and optimises intra-cluster compactness,
while HCPC infers cluster number from hierarchical structure, aiming to
capture global data separation. These complementary results provide
insights at different levels of granularity.
pca_coords2 <- as.data.frame(pca_result$ind$coord[, 1:3])
colnames(pca_coords2) <- c("PC1", "PC2", "PC3")
pca_coords2$label <- unlabeled_use$label_proximity
pal2 <- brewer.pal(max(3, length(unique(pca_coords2$label))), "Set2")
plot_ly(data = pca_coords2,
x = ~PC1, y = ~PC2, z = ~PC3,
color = ~label, colors = pal2,
type = "scatter3d", mode = "markers",
marker = list(size = 3)) |>
layout(title = "PCA - Colored by Class Label (Centroid Proximity)",
legend = list(title = list(text = "Class")),
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
))
# Ensure Class is a factor
annotated$Class <- as.factor(annotated$Class)
# Remove class temporarily, scale, then add back
ann <- select(annotated, -Class) |> scale() |> as_tibble()
ann$Class <- annotated$Class
# Create training/test split
set.seed(82171165)
idx <- createDataPartition(ann$Class, p = 0.7, list = FALSE)
train <- ann[idx, ]
test <- ann[-idx, ]
# Train classifier
model <- train(Class ~ ., data = train, method = "svmRadial")
# Predict and evaluate
pred <- predict(model, test)
confusionMatrix(pred, test$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E F G
## A 1 0 0 0 0 0 0
## B 0 1 0 0 0 0 0
## C 0 0 4 0 1 0 0
## D 0 0 0 6 0 0 0
## E 0 0 0 0 1 0 0
## F 0 0 0 0 0 6 0
## G 1 0 0 0 0 0 6
##
## Overall Statistics
##
## Accuracy : 0.9259
## 95% CI : (0.7571, 0.9909)
## No Information Rate : 0.2222
## P-Value [Acc > NIR] : 1.014e-14
##
## Kappa : 0.9085
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity 0.50000 1.00000 1.0000 1.0000 0.50000 1.0000
## Specificity 1.00000 1.00000 0.9565 1.0000 1.00000 1.0000
## Pos Pred Value 1.00000 1.00000 0.8000 1.0000 1.00000 1.0000
## Neg Pred Value 0.96154 1.00000 1.0000 1.0000 0.96154 1.0000
## Prevalence 0.07407 0.03704 0.1481 0.2222 0.07407 0.2222
## Detection Rate 0.03704 0.03704 0.1481 0.2222 0.03704 0.2222
## Detection Prevalence 0.03704 0.03704 0.1852 0.2222 0.03704 0.2222
## Balanced Accuracy 0.75000 1.00000 0.9783 1.0000 0.75000 1.0000
## Class: G
## Sensitivity 1.0000
## Specificity 0.9524
## Pos Pred Value 0.8571
## Neg Pred Value 1.0000
## Prevalence 0.2222
## Detection Rate 0.2222
## Detection Prevalence 0.2593
## Balanced Accuracy 0.9762
Accuracy: 92.59% The model correctly classified ~93% of the test samples. 95% CI: (0.76, 0.99) The confidence interval for accuracy indicates the model is robust.
Class C and G: High sensitivity but slightly lower precision — suggesting that while the model correctly detects these classes, there are some false positives.
Class A: Only one out of two Class A samples was correctly classified — low recall but high precision.
The model performs very well overall, especially considering the small sample size and class imbalance. It maintains perfect recall (1.0) for 5 out of 7 classes. Misclassifications are minimal and generally involve classes with few examples (e.g., A and E), which is expected in low-data scenarios. The classifier is suitable for practical seed classification and generalises well to unseen data.